Skip to content

Conversation

@daviesrob
Copy link
Member

cram_next_slice() with a region set tries to skip over containers with contents that don't overlap the region of interest. For files with a mixture of read lengths, it's possible that some decode jobs have been queued before a later container is skipped. This makes calling cram_seek() hazardous as it drops any in-flight jobs before updating the file position. Instead, call hseek() directly so already-queued decode jobs are retained.

Fixes samtools/samtools#2285 (CRAM region queries miss reads when using additional threads)

cram_next_slice() with a region set tries to skip over containers
with contents that don't overlap the region of interest.  For
files with a mixture of read lengths, it's possible that some
decode jobs have been queued before a later container is skipped.
This makes calling cram_seek() hazardous as it drops any in-flight
jobs before updating the file position.  Instead, call hseek()
directly so already-queued decode jobs are retained.
The test creates a CRAM file with records in lots of containers,
many of which should be skipped when making a region query that
expects to return the first and last two records in the file.
@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 1, 2025

My initial worry with this was why do we have a cram_seek function at all when hseek would suffice? Tracking it back it goes all the way to io_lib's initial indexing implementation here which didn't have an hseek.

So it looks like there wasn't a specific concern over using system seeks. I do see that the original seek had more smarts and could do read-and-discard when operating on a pipe, which is something we lost when we switched to wrapping up hseek (deliberately: see 1f338eb). There have been other changes in the interim too, such as clearing the ooc flag (out of containers), although I note you do that too which is good. I'll investigate the other changes as I'm wondering if this is problematic elsewhere.

@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 1, 2025

I tested samtools with it, and also tested the issues reported in #655 which gave rise to the initial addition of the cram_drain_rqueue call. That has a convenient test which I repeated on a clone of 1.6 to be sure I had a case which broke. The new code doesn't recreate that problem.

Looks good. Thanks

@jkbonfield jkbonfield merged commit fe1721d into samtools:develop Dec 1, 2025
9 checks passed
@daviesrob daviesrob deleted the threaded-cram-regions branch December 2, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CRAM region queries miss reads when using additional threads

2 participants